fix: adapt SDK ProxyConfiguration to crawlee v4 API#596
fix: adapt SDK ProxyConfiguration to crawlee v4 API#596B4nan wants to merge 8 commits intofix/storage-client-v4-adaptfrom
Conversation
Crawlee v4 reshaped `ProxyConfiguration`: - `newProxyInfo` and `newUrl` now take a single `TieredProxyOptions` argument; the previous `(sessionId, options)` pair is gone. - The protected `_handleCustomUrl(sessionId)` helper was removed; the `_callNewUrlFunction` and `_handleTieredUrl` helpers now take options only. - `ProxyInfo` (in `@crawlee/types`) no longer carries `sessionId`. Changes: - `newProxyInfo` and `newUrl` accept `string | number | TieredProxyOptions | undefined` so existing SDK callers that pass a raw `sessionId` keep working, while the override remains compatible with crawlee's v4 signature. A small `parseSessionIdOrOptions` helper discriminates and pulls `sessionId` from `options.request` when no explicit one is given. - Inlined custom-URL session stickiness via a new private `getSessionIndex(sessionId)` (replacing the removed `_handleCustomUrl`), keyed on `usedProxyUrls` like the base class. - Re-declared `sessionId?: string` on the SDK's `ProxyInfo` interface so users can still read `proxyInfo.sessionId` (v3 carried it on the base type). - Re-imported `ProxyInfo` from `@crawlee/types` (no longer re-exported from `@crawlee/core`). - Tightened a `proxyUrls.some(url => url.includes(...))` access for the new `(string | null)[]` array shape. Stacked on #583 (config redesign); rebases onto v4 once that lands.
- Custom URL rotation: post-increment the round-robin index so the
first sessionless call returns proxyUrls[0] (was off-by-one).
- Surface `username` on the returned ProxyInfo by parsing it out of
the resolved URL — v3 carried it via `super.newProxyInfo`.
- parseSessionIdOrOptions now rejects non-plain objects (e.g. Date,
Array) so `newUrl(new Date())` throws as users expect.
test: `newUrl({})` is no longer 'invalid' — empty TieredProxyOptions
is a legal v4 call shape; documented the carve-out.
…oxyInfo shape
- newUrl/newProxyInfo accept an optional second `legacyOptions`
argument so existing callers that pass `(sessionId, {request})`
keep working under the v4 shape too.
- Returned ProxyInfo omits Apify-only fields (groups, countryCode)
when not using Apify Proxy and only includes `proxyTier` when
defined — matches v3's strict-deep-equal expectations.
…nfiguration tests - ProxyInfo.username is now the decoded form (`user@name` rather than `user%40name`), matching v3 behaviour and the test expectations. - Added a beforeEach to the `Actor.createProxyConfiguration()` describe that resets serviceLocator + Configuration.globalConfig + Actor._instance so each test sees the env vars it sets.
crawlee v4 (apify/crawlee#3599, beta.51) removed `tieredProxyUrls`, `tieredProxyConfig`, `_handleTieredUrl`, and `proxyTier` from `ProxyConfiguration` / `ProxyInfo`. The SDK's wrapper used to thread those through to the base class; with the upstream API gone, that plumbing has to go too. - Remove the `tieredProxyConfig` field from the SDK's `ProxyConfigurationOptions`. - Drop the constructor branch that forwarded `tieredProxyUrls` / `tieredProxyConfig` to the base class and the now-unreachable `_generateTieredProxyUrls` helper. - Drop the `tieredProxyUrls` short-circuit and `proxyTier` field from `newUrl` / `newProxyInfo`. - Drop the corresponding test groups in `proxy_configuration.test.ts`.
b490925 to
4f718b5
Compare
| // `tieredProxyUrls` / `tieredProxyConfig` were removed from | ||
| // crawlee v4 (apify/crawlee#3599); the corresponding test groups | ||
| // were dropped here too. |
There was a problem hiding this comment.
nit: can we do away with these "gravestone" comments?
These are useful when reading the AI output, but I don't think we should commit these. Future maintainers won't care about the tests that are not here anymore.
| sessionIdOrOptions?: | ||
| | string | ||
| | number | ||
| | Parameters<CoreProxyConfiguration['newProxyInfo']>[0], | ||
| legacyOptions?: Parameters<CoreProxyConfiguration['newProxyInfo']>[0], |
There was a problem hiding this comment.
The removal of the sessionId parameter from Crawlee v4 was intentional, see this comment .
According to the new "UserPool" design, ProxyConfiguration shouldn't care for Session details (the resolved proxy URL is stored in a Session after it's retrieved and the ProxyConfiguration is not queried again).
Imo, each call to newProxyInfo in SDK should just return a random, valid proxy URL - that is, e.g., with random session IDs. This way, the SDK shields the user from the Apify Proxy session implementation.
proxyConfig.newProxyInfo() // { url: "http://session-131231@proxy.apify.com" }
proxyConfig.newProxyInfo() // { url: "http://session-234244@proxy.apify.com" }
proxyConfig.newProxyInfo() // { url: "http://session-342434@proxy.apify.com" }
// ...As far as the consumer is concerned, these are just opaque URLs.
| private getSessionIndex(sessionId: string): number { | ||
| if (!this.usedProxyUrls.has(sessionId)) { | ||
| this.usedProxyUrls.set( | ||
| sessionId, | ||
| this.proxyUrls![ | ||
| this.usedProxyUrls.size % this.proxyUrls!.length | ||
| ], | ||
| ); | ||
| } | ||
| return this.proxyUrls!.indexOf(this.usedProxyUrls.get(sessionId)!); | ||
| } |
There was a problem hiding this comment.
What is the reason behind this? Perhaps we can remove this, given the returned urls should be independent?
Summary
Crawlee v4 reshaped
ProxyConfiguration:newProxyInfo/newUrlnow take a singleTieredProxyOptionsargument; the(sessionId, options)pair is gone._handleCustomUrl(sessionId)helper was removed._callNewUrlFunction/_handleTieredUrltake options only.ProxyInfo(in@crawlee/types) no longer carriessessionId.This PR adapts the SDK's override:
newProxyInfoandnewUrlacceptstring | number | TieredProxyOptions | undefined— existing SDK callers that pass a rawsessionIdkeep working, and the override is also compatible with crawlee's v4 single-options signature. A smallparseSessionIdOrOptionshelper discriminates and pullssessionIdfromoptions.requestwhen no explicit one is given.getSessionIndex(sessionId)(replacing the removed_handleCustomUrl), keyed on the inheritedusedProxyUrlsmap.sessionId?: stringon the SDK'sProxyInfointerface so users can keep readingproxyInfo.sessionId.ProxyInfois now imported from@crawlee/types(no longer re-exported from@crawlee/core)..some(url => url.includes(...))for the new(string | null)[]shape.Stacking
Depends on #583 (config redesign). Rebases cleanly onto v4 once that lands.